Segmentation Granularity in Dependency Representations for Korean
نویسنده
چکیده
Previous work on Korean language processing has proposed different basic segmentation units. This paper explores different possible dependency representations for Korean using different levels of segmentation granularity — that is, different schemes for morphological segmentation of tokens into syntactic words. We provide a new Universal Dependencies (UD)-like corpus based on different levels of segmentation granularity for Korean. The corpus contains 67K words in 5,000 sentences which are split into training, development and evaluation data sets. We report parsing results using the new dependency corpus for Korean and compare them with the previous Korean UD corpus. 1 Dependency Parsing and the Korean Language Language processing including morphological analysis for Korean has traditionally been based on the eojeol, which is a basic segmentation unit delimited by a blank in the sentence. Let us consider the sentence in (1), which contains ten eojoels (the corresponding morphological analysis is found in Figure 1). The number of eojoels is entirely based on the blank space character and the tenth eojeol in (1) also includes the punctuation mark. Almost all natural-language processing systems that have been previously developed for Korean have used the eojeol as a fundamental unit of analysis. As Korean is an agglutinative language, joining content and functional morphemes is very productive and they can be combined exponentially. For example, yeoghal (‘role’) is a content morpheme (a common noun) and -eul, a case marker (‘ACC’, accusative), is a functional morpheme.1 They form together a single eojeol yeoghal-eul (‘role + ACC’). A predicate gangjoha-ass-da (‘focused’) also consists of the content morpheme gangjo-ha (‘focus’) and its functional morphemes, -ass (‘PAST’, past tense) and da (‘IND’, indicative), respectively. In this paper, we analyze different levels of segmentation granularity in dependency representations for syntactic annotation (§2). We then propose a scheme to build a new Universal Dependencies (UD)-like corpus for Korean based on segmentation granularity (§3). UD has been developed cross-linguistically using a consistent treebank annotation scheme for many languages.2 We provide 5,000 sentences based on each of the segmentation granularity possibilities described in this paper. We also present its UD parsing results, compare them with previously proposed UD for Korean (§4), and discuss future perspectives of dependency annotation and parsing for Korean (§5). 2 Segmentation Granularity for Korean We define the following four different levels of segmentation granularity for Korean. These granularity levels have been independently proposed in previous work on Korean language processing as different basic segmentation units.
منابع مشابه
Neuron-level Selective Context Aggregation for Scene Segmentation
Contextual information provides important cues for disambiguating visually similar pixels in scene segmentation. In this paper, we introduce a neuron-level Selective Context Aggregation (SCA) module for scene segmentation, comprised of a contextual dependency predictor and a context aggregation operator. The dependency predictor is implicitly trained to infer contextual dependencies between dif...
متن کاملGranularity in the Cross-linguistic Encoding of Motion and Location {*}
In this study, we explore three ways in which the notion of "granularity" emerges from the study of cross-linguistic event semantics. The first interpretation of granularity has to do with event segmentation for linguistic expressions. Where humans place event boundaries varies depending on the language and cultural setting in which the event is encoded. Second, within the set boundaries of a ‘...
متن کاملAn improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کاملImproving Dependency Parsers with Supertags
Transition-based dependency parsing systems can utilize rich feature representations. However, in practice, features are generally limited to combinations of lexical tokens and part-of-speech tags. In this paper, we investigate richer features based on supertags, which represent lexical templates extracted from dependency structure annotated corpus. First, we develop two types of supertags that...
متن کاملUsing Dependency Parses to Augment Feature Construction for Text Mining
(ABSTRACT) With the prevalence of large data stored in the cloud, including unstructured information in the form of text, there is now an increased emphasis on text mining. A broad range of techniques are now used for text mining, including algorithms adapted from machine learning, NLP, computational linguistics, and data mining. Applications are also multi-fold, including classification, clust...
متن کامل